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ABSTRACT 

The discovery of frequent itemsets can serve valuable eco- 
nomic and research purposes. Releasing discovered frequent 
itemsets, however, presents privacy challenges. In this pa- 
per, we study the problem of how to perform frequent item- 
set mining on transaction databases while satisfying differ- 
ential privacy. We propose an approach, called PrivBasis, 
which leverages a novel notion called basis sets. A #-basis 
set has the property that any itemset with frequency higher 
than 9 is a subset of some basis. We introduce algorithms 
for privately constructing a basis set and then using it to 
find the most frequent itemsets. Experiments show that our 
approach greatly outperforms the current state of the art. 

1. INTRODUCTION 

Frequent itemset mining is a well recognized data min- 
ing problem. The discovery of frequent itemsets can serve 
valuable economic and research purposes, e.g., mining as- 
sociation rules [5], predicting user behavior [3], and find- 
ing correlations [11]. Publishing frequent itemsets, however, 
may reveal the information of individual transactions, com- 
promising the privacy of them. 

In this paper, we study the problem of how to perform fre- 
quent itemset mining (FIM) on transaction databases while 
satisfying differential privacy. Differential privacy [17] is an 
appealing privacy notion which provides worst-case privacy 
guarantees. In recent years, it has become the de facto stan- 
dard notion of privacy for research in private data analysis. 
The key challenge in private FIM is that the dimensionality 
of transactional datasets is very high. While effective tech- 
niques for differentially private data publishing have been 
developed for low-dimensional datasets (e.g., [23, 33]), these 
techniques do not apply to high-dimensional data. In fact, 
even for the weaker privacy notion of fc-anonymity, the curse 
of high dimensionality effect is well known [4] . 

Our work is inspired by Bhaskar et al.'s KDD10 paper [8], 
in which they propose an approach to privately publish top k 
frequent itemsets and their frequencies. Their approach first 
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selects k itemsets from the set of all itemsets that include 
at most m items, and then adds noise to the frequencies 
of these selected itemsets. This approach works reasonably 
well for small k values; however for larger values of k, the 
accuracy is poor. The main reason is that for larger k values, 
one has to set the size limit m to be larger (e.g., 3, 4, or 
higher). This results in a very large candidate set from which 
the algorithm must select the top k, making the selection 
inaccurate. 

In this paper we propose a novel approach that avoids the 
selection of top k itemsets from a very large candidate set. 
More specifically, we introduce the notion of basis sets. A 
#-basis set B = {Bi, Bi, . . . , B m }, where each Bi is a set 
of items, has the property that any itemset with frequency 
higher than 9 is a subset of some basis Bi. A good basis set 
is one where w is small and the lengths of all Si's are also 
small. Given a good basis set B, one can reconstruct the 
frequencies of all subsets of Bi's with good accuracy. One 
can then select the most frequent itemsets from these. We 
also introduce techniques to construct good basis sets while 
satisfying differential privacy. Finally, we have conducted 
extensive experiments, and the results show that our ap- 
proach greatly outperforms the existing approach. 

We call our approach PrivBasis. It meets the challenge of 
high dimensionality by projecting the input dataset onto a 
small number of selected dimensions that one cares about. 
In fact, PrivBasis often uses several sets of dimensions for 
such projections, to avoid any one set containing too many 
dimensions. Each basis in B corresponds to one such set 
of dimensions for projection. Our techniques enable one 
to select which sets of dimensions are most helpful for the 
purpose of finding the k most frequent itemsets. 

The rest of this paper is organized as follows. The next 
section introduces the background knowledge about differ- 
ential privacy and frequent itemset mining. In Section 3 
we analyze the state of the art on private FIM and identify 
the challenges in this problem. Our approach is presented 
in Section 4. We report experimental results in Section 5. 
Section 6 reviews related work and Section 7 concludes our 
work. 

2. PRELIMINARIES 
2.1 Differential Privacy 

Informally, differential privacy requires that the output of 
a data analysis mechanism be approximately the same, even 
if any single tuple in the input database is arbitrarily added 
or removed. 
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Definition 1 (e-Differential Privacy [16, 17]). 
A randomized mechanism A gives e- differential privacy if 
for any pair of neighboring datasets D and D' , and any 
S e Range (A), 

Pr [A(D) = S] < e ■ Pr [A(D') = S] . 

In this paper we consider two datasets D and D' to be 
neighbors if and only if either D = D' + t or D' = D + t, 
where D + t denotes the dataset resulted from adding the 
tuple t to the dataset D. We use D ~ D' to denote this. 
This protects the privacy of any single tuple, because adding 
or removing any single tuple results in e £ -multiplicative- 
bounded changes in the probability distribution of the out- 
put. If any adversary can make certain inference about a tu- 
ple based on the output, then the same inference is also likely 
to occur even if the tuple does not appear in the dataset. 

Differential privacy is composable in the sense that com- 
bining multiple mechanisms that satisfy differential pri- 
vacy for ei , • • • , e m results in a mechanism that satisfies e- 
differential privacy for t = X^i 6 *- Because of this, we refer 
to e as the privacy budget of a privacy-preserving data anal- 
ysis task. When a task involves multiple steps, each step 
uses a portion of e so that the sum of these portions is no 
more than e. 

There are several approaches for designing mechanisms 
that satisfy e-differential privacy. In this paper we use two 
of them. The first approach computes a function g on the 
dataset D in a differentially privately way, by adding to 
g(D) a random noise. The magnitude of the noise depends 
on GS/, the global sensitivity or the L\ sensitivity of g. Such 
a mechanism A g is given below: 

A g (D) =g(D) + Lap 
where GS = max \q(D) — q(D')\, 

and Pr [Lap (0) = x] =^e~ M/fi 

In the above, Lap(/3) denotes a random variable sampled 
from the Laplace distribution with scale parameter j3. This 
is generally referred to as the Laplacian mechanism for sat- 
isfying differential privacy. 

The second approach computes a function g on a dataset 
D by sampling from the set of all possible answers in the 
range of g according to an exponential distribution, with an- 
swers that are "more accurate" will be sampled with higher 
probability. This is generally referred to as the exponential 
mechanism [28]. This approach requires the specification of 
a quality function q : V x 1Z — > E, where the real valued 
score q(D, r) indicates how accurate it is to return r when 
the input dataset is D. Higher scores indicate more accurate 
outputs which should be returned with higher probabilities. 
Given the quality function q, its global sensitivity GS 9 is 
defined as: 

GS„ = max max \q(D,r) — q(D',r)\ 

r (D,D'):D~D> 

The following method M satisfies e-differential privacy: 

Pr[/4(i?) = r]ocexp(^-«(£>,r)) (1) 

For example, if q(D,ri) — q(D,r2) = 1, then n should be 
returns y times more likely than ri, with y = exp y 2 cs ) • 
The larger the exponent j^- is, the more likely that M will 
return the higher quality result. 



Symbol 


Description 


D 


The transaction dataset 


N 


The number of transactions in D 


I 


The set of items 


B 


The basis set 


/(*) 


The frequency of itemset X 


A 


The number of unique items in the 
set of top-fc itemsets 


fk 


The frequency of the fc-th most fre- 
quent itemset 



Table 1: The notations 



As pointed out in [28], in some cases the quality func- 
tion satisfies the condition that when the input dataset is 
changed from D to D' , the quality values of all outcomes 
change only in one direction, i.e., 

V D ^ D , [(3 ri q(D, n) < q(D', n)) -> (V r2 q(D, r 2 ) < q(D' , ra))] 

Then one can remove the factor of 1/2 in the exponent of 
Equation (1) and return r with probability proportional to 

ex p(jg~ ( l(D, r)J . This improves the accuracy of the result. 

2.2 Frequent Itemset Mining 

Frequent itemset mining (FIM) is a well studied problem 
in data mining. It aims at discovering the itemsets that fre- 
quently appear in a transactional dataset. More formally, 
let I be a set of items and let D = [ti, £2, ■ ■ ■ , £jv] be a trans- 
action dataset where ti C I, and iV be the number of trans- 
actions in D. The frequency of an itemset X C I, denoted 
by f(X), is the fraction of transactions in D that include 
X as a subset. Thus < f(X) < 1 for any X. Given a 
frequency threshold 8 such that < 9 < 1, we say that an 
itemset X is 0-frequent when f(X) > 6. 

The FIM problem can be defined as either taking the min- 
imal frequency 8 as input and returning all ^-frequent item- 
sets together with their frequencies, or as taking an integer 
k as input, and returning the top k most frequent itemsets, 
together with their frequencies. One can easily convert one 
version to the other. 

Several algorithms have been proposed for finding fre- 
quent itemsets. The two most prominent ones are the Apri- 
ori algorithm [5], and the FP-Growth algorithm [22]. The 
Apriori algorithm exploits the observation that if an item- 
set X is frequent, then all its subsets must also be frequent. 
The algorithm works by generating itemsets of length n from 
itemsets of length n — 1, eliminating candidates that have 
an infrequent pattern. The FP-Growth algorithm skips the 
candidate itemset generation process by using a compact 
tree structure to store itemset frequency information. 

3. THE EXISTING APPROACH 

This paper is inspired by Bhaskar et al.'s paper [8] in 
KDDTO, which proposed an approach for releasing the top 
k itemsets of a predefined length m. That is, among all 
itemsets of length exactly m, one chooses the k most fre- 
quent ones, and releases their frequencies. This method can 
be easily extended to the case of releasing top k itemsets of 
length at most m, instead of exactly m. This can then be 
used to return top k most frequent itemsets by choosing an 
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appropriate m. To the best of our knowledge, this is the only 
existing approach that publishes frequent itemsets while sat- 
isfying differential privacy. A key concept introduced in this 
approach is the notion of Truncated Frequencies, we thus 
call this approach the TF method. 

The TF method has two steps. In the first step, one se- 
lects which fc itemsets to release. In the second step, one 
releases the frequencies of these itemsets after adding noises 
to them. The privacy budget e is evenly divided between 
these two steps. The second step is straightforward. Given 
fc itemsets, releasing their frequencies have sensitivity jf, as 
adding or removing one transaction can affect the frequency 
of each itemset by at most Thus adding noise accord- 
ing to Lap to the frequency of each of the k itemsets 
satisfies (e/2)-differential privacy. 

The main challenge lies in the first step, namely, selecting 
the k most frequent itemsets. The TF method selects these 
from the set of all itemsets of length at most m; we use U to 
denote this set of candidate itemsets. The number of these 
candidate itemsets is 




which is exponential in m. 

Because \U\ is large, enumerating through all elements in 
U can be computationally expensive. The key novelty in the 
TF approach is to use the notion of truncated frequencies 
to avoid explicitly enumerating through all itemsets in U. 
The truncated frequency of X £ U is defined as f(X) = 
max(/(X),/ fc — 7), where f k is the frequency of the fc'th 
most frequent itemset in U, and 7 is a parameter computed 
using Equation (3) below. 

The intuition is that for an itemset with frequency below 
fk — 7, one does not need to explicitly consider the item- 
set; instead, it suffices to use fk — 7 as the upperbound of 
the itemset's frequency. The parameter 7 must be selected 
to ensure that an itemset of frequency less than fk — 7 is 
selected only with low probability; it is computed as follows: 

7 =§(m£ + m|tf|) (3) 

The value p bounds the error probability and should be be- 
tween and f. 

Two methods were proposed to select the k most frequent 
itemsets from U using the truncated frequencies. The first 
is to add Lap (7^) to the truncated frequencies of all item- 
sets in U, and then select the k with highest noisy frequen- 
cies. The second method is to use repeated applications of 
the Exponential Mechanism. One samples k times without 
replacement, such that the probability of selecting an item- 
set, X, is proportional to exp ^^-/(X)^. It is shown that 

both methods satisfy (e/2)-differential privacy. For both 
methods, the algorithm explicitly considers only the item- 
sets with frequencies > fk — 7, and estimates the probability 
that it should select an itemset whose frequency is < f k — 7 
(i.e., with truncated frequency = fk — 7), and then randomly 
samples such an itemset. 

Furthermore, it is proven that the output of either method 
provides the following utility guarantee: With probability 
1 — p, every itemset with true frequency at least /fe + 7 are 
selected, and every selected itemset has true frequency at 
least f k - 7. 



3.1 Analysis of the tf Method 

The TF method works well when k is small. However, 
it scales poorly with larger k. To see why this is the case, 
recall that the TF method enumerates only itemsets with 
frequencies above fk — 7 to prune the search space. When 
fk — 7 < 0, i.e, when 7 > fk, this technique results in no 
pruning at all, and the algorithm degenerates into explic- 
itly enumerating through all elements of U. Furthermore, 
the proven utility guarantee (e.g., every selected itemset has 
frequency at least fk — 7) is meaningless when fk — 7 < 0. 

Unfortunately, as Table 2(a) shows, in many datasets with 
large k (k > 100, or k > 200), we have 7 larger than, or very 
close to fk . To see why, observe that 

4k /, k , , TT \ 4k „ lrm 4fcmln|I| 
7=^(ln- + ln|[7|j>-(ln| f /| ) «— ^ 

That is, 7 grows linearly in km. When k is large, the 
top k itemsets likely include many itemsets of sizes 3, 4, or 
higher. If one chooses a small m, then one misses all frequent 
itemsets that of size greater than m. If one chooses a larger 
m, e.g., 3 or higher, then the 7 value is too large, rendering 
the mechanism unfeasible. 

We observe that a deeper reason why the TF method does 
not scale is that when one needs to select the top k itemsets 
from a large set U of candidates, the large size of U causes 
two difficulties. The first is regarding the running time, i.e., 
a large \U\ makes enumerating through all elements in U 
unfeasible. The second difficulty is about accuracy, i.e., a 
large \U\ makes the selection of top k candidates from U 
inaccurate. Even if every single low-frequency itemset in U 
is chosen with only a small probability, the sheer number of 
such low-frequency itemsets means that the k selected item- 
sets likely include many infrequent ones. The TF technique 
tries to address the running time challenge by pruning the 
search space, but it does not address the accuracy challenge. 
This addresses only one symptom caused by a larger candi- 
date set, but not the root cause. In the end, even the goal 
of improving running time cannot be achieved when \U\ is 
large, because the accuracy requirement forces a large 7. 

4. THE PRIVBASIS METHOD 

In this section, we introduce the PrivBasis method for 
publishing the top k frequent itemsets. If one desires to 
publish all itemsets above a given threshold 9, one can com- 
pute the value k such that the fc'th most frequent itemset 
has frequency > 9 and the fc + l'th itemset has frequency 
< 9, and then uses PrivBasis to find the top fc frequent 
itemsets. 

4.1 Overview of PrivBasis 

We observe that the key challenge of dealing with trans- 
action datasets is their high dimensionality. The PrivBasis 
approach can be viewed as meeting the challenge by project- 
ing the input dataset D onto lower dimensions. For example, 
let B be the set of £ most frequent items in D; projecting 
D to items in B means removing from every transaction 
all the items that are not in B. The I items in B can be 
viewed as £ binary attributes that partition the dataset into 
2 bins. Using the standard Laplacian mechanism, one can 
obtain the noisy frequency of each bin, through which one 
can reconstruct the frequencies of all subsets of B. For this 
method to work, the value I cannot be much larger than 
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datasct 


N 


III 


avg \t\ 


k 


A 


A 2 


A 3 


retail 


88162 


16470 


11.3 


100 


38 


37 


21 


mushroom 


8124 


119 


24 


100 


11 


30 


36 


pumsb-star 


49046 


2088 


50 


200 


17 


31 


50 


kosarak 


990002 


41270 


8.1 


200 


39 


84 


58 


AOL 


647377 


2290685 


34 


200 


171 


29 






(a) Dataset parameters: avg \t\ is the average transaction 
length, A is the number of unique items in the top k item- 
sets, and A2 (A3, resp.) is the number of pairs (size-3 
itemsets, resp.) in the top k itemsets. Note that we choose 
p — 0.9, which requires utility guarantee only with the low 
probability of 1 — p — 0.1. 



dataset 


k 


fk-N 


m 




7 ■ N 


retail 


100 


1192 


1 


16, 470 


5768 


mushroom 


100 


4464 


2 


7,104 


5433 


pumsb-star 


200 


28613 


3 


1.5 x 10 M 


21235 


kosarak 


200 


14142 


2 


8.5 x 10 8 


20733 


AOL 


200 


12450 


1 


2.3 x 10" 


16038 



(b) Effectiveness of the TF approach due to 
Bhaskar et al., when applied to selecting top k 
itemsets that are of length at most m. When the 
column 7 • N is larger than f k ■ N, the truncated 
frequency approach is completely ineffective. 



Table 2: Effectiveness of the TF approach, and dataset parameters 



a dozen. However, for many datasets, when recovering the 
top k itemsets for k = 100, k = 200, or even larger k, one 
needs to go beyond the first dozen or so most frequent items. 
Thus one needs to choose more than one sets of items (i.e., 
dimensions) for projections. How to select these in a differ- 
entially private fashion, and how to best utilize information 
obtained from them are the main challenges that need to be 
solved by the PrivBasis method. 

Formalizing the above intuitions, we introduce the con- 
cept of #-basis sets of a transaction dataset. 

Definition 2 (0-Basis Set). Given a transaction 
dataset D over items in I, and a threshold 6, we say that 
B = {Bi, B2, . . . , B w }, where Bi C I for 1 < i < w, is a 
8 -basis set for D, if and only if for any 6 -frequent itemset 
X CI, there exists Bi £ B, such that X C Bi. We say that 
Bi covers X . 

We call w the width of the basis set, i = maxi<K„ \Bi\ 
the length of the basis set, and each Bi a basis. 

Given a dataset D and a #-basis set B for it, we can privately 
reconstruct with reasonable accuracy the frequencies of all 
itemsets in the the following candidate set C(B). 

Definition 3 (Candidate Set). The candidate set 
given a 9-basis B = {B\, B2, ■ ■ . , B w } is defined as 

w 

C(B) = \J{X\X C B,} 

i=l 

That is, the candidate set C(B) is the set of all itemsets 
that are covered by some basis Bi in B. When B is a #-basis 
set, all ^-frequent itemsets are in C(B). 

At a high level, the PrivBasis method consists of the fol- 
lowing steps. 

1. Obtain A, the number of unique items that are involved 
in the k most frequent itemsets. 

2. Obtain F, the A most frequent items among I. The 
desired goal (which can be approximately achieved) is 
that F includes exactly the items that appear in the 
top k itemsets. 

3. Obtain P, a set of the most frequent pairs of items 
among F. The desired goal is that P includes exactly 
the pairs of items that appear in the top k itemsets. 



4. Construct B, using F and P. The desired goal is that 
B is a /fe-basis set with small width and length. 

5. Obtain noisy frequencies of itemsets in C(B); one can 
then select the top k itemsets from C(B). 

In the rest of this section, we present details of these steps. 
We do this in a reverse order, first presenting Step 5 in Sec- 
tion 4.2, then Step 4 in Section 4.3, and finally the complete 
algorithm, including details of Steps 1 to 3 in Section 4.4. 

4.2 Generating Noisy Counts for c(B) 

Algorithm 1 gives the BasisFreq algorithm for comput- 
ing the noisy counts of all itemsets in C(B). In the algo- 
rithm we compute noisy counts, which can be translated 
into frequencies easily. The key ideas of the algorithm are 
as follows. Each basis Bi divides all possible transactions 
into 2' Bi ' mutually disjoint bins, one corresponding to each 
subset of Bi. For each X C Bi, the bin corresponding to X 
consists of all transactions that contain all items in X, but 
no item in Bi \ X. 

Given a basis set B, adding noise Lap(w/e) to each bin 
count and outputting these noisy counts satisfy e-differential 
privacy. For each basis Bi, adding or removing a single 
transaction can affect the count of exactly one bin by exactly 
1. Hence the sensitivity of publishing all bin counts for one 
basis is 1; and the sensitivity for publishing counts for all 
bases is w. In Algorithm 1, lines 2 to 11 compute these noisy 
bin frequencies. 

From these frequencies, one can then recover the counts 
of all itemsets in C. For example, a basis {a, b, c} divides all 
transactions into 8 bins: {^a, -16, ^c} (not containing any 
of a, b, c), {a, ->b, ^c}, • • • , {a, b, c}. The count of the item- 
set {a, b} can then be obtained by summing up the counts 
for the two bins {a, b, ->c} and {a,b,c}. Lines 12 to 26 in 
Algorithm 1 compute the noisy counts for itemsets in C. 

Theorem 1. Algorithm 1 is e- differentially private. 

Proof. The only part in Algorithm 1 that depends on 
the dataset is computing the noisy bin counts &[i][-X]. As 
discussed above, publishing all bin counts has sensitivity 
w/N. Line 4 adds Laplacian noise to satisfy e-differential 
privacy taking this sensitivity into consideration. Starting 
from line 12, the algorithm only performs post-processing, 
and does not access D again. □ 

Running Time. We now analyze the running time of 
Algorithm 1. The algorithm has four parts. The first part 
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Algorithm 1 BasisFreq: 

Itemsets using Basis Sets 



Privately Releasing Frequent 



{Bi 



, B w } , k , 



Input: Transactional dataset D, B 
differential privacy budget e. 
Output: Top k frequent itemsets in C and their frequen- 



1 

2 
3 
4 
5 
6 
7. 
8 
9 
10 
11 
12 
13 
14 
15 

16: 

17. 
18 
19 
20 
21 
22 
23 
24 
25 
26: 
27' 
28. 
29: 



function BasisFreq(Z), B = {-Bi, • • • , B w }, k, e) 
for i = 1 -> id do 

for j = -> 2 |s * - 1 do 

6[i]b1^Lap(f) 
end for 
end for 

for all f G D do 

for i — 1 — > w do 

6[i][t nfii] «- i>[i][tnB,] + 1 

end for 
end for 

for i = 1 — > to do 
for all X C Bi do 

ni;«-2l B 'H*l 

if C(X) is undefined then 

C(X).nc «- nc 

C(A").d <- m> 
else 

v <- C(X).v 

C(X).nc «- -^-C(X).nc + -?—nc 

V / v-\-nv \ / u+nu 

C(A").d <- 
end if 
end for 
end for 

R the k elements in X's with highest C(X).nc 
return R 
end function 



Comments 

• The array element &[i][A] stores the noisy count of 
the bin corresponding to itemset X and basis Bi. We 
point out that a subset of Bi is easily converted to 
binary number of \Bi\ bits, which is an integer index 
in0..2l s *l -1. 

• The rationale behind lines 21-23 are explained in Sec- 
tion 4.2 in the "Noisy Frequency Accuracy Analysis" 
paragraph. 



(lines 2 to 6) initializes the array b, which takes 0(w2 e ) 
time. The second part (lines 7 to 11) scans the dataset D 
and matches each transaction with each basis, and takes 
time 0(w|D|), where \D\ is the sum of the lengths of all 
transactions in D. The third part (lines 12 to 26) computes 
the noisy counts of itemsets in the candidate set C. This 
part's runtime is dominated by line 15, which for each X C 
Bi requires 0(2' Si '~' x ') operations. The total number of 
operations involved for each basis Bi is 



Bi 



= 3 |s *'-2 



Si I 



Thus the third part takes time 0(w3 £ ). The last part 
(line 27) sorts the noisy counts of itemset in C to select 
top k and takes time 0(w£(log w)2 e ). Thus Algorithm 1 has 
time complexity 0(w\D\ + w3 e ). For large dataset, we will 
have £ < log 3 |D|, and the running time is dominated by 
0(w\D\). This analysis shows that w is a linear factor on 
the running time, while £ has an exponential effect. In our 
experiments we limit £ to be at most 12, and often 10 or 
smaller. 

Accuracy Analysis. We now analyze the accuracy of 
the noisy frequencies obtained via Algorithm 1. Let nfi (X) 
denote the noisy frequency of an itemset X from a basis Bi. 
We use the Error Variance as the measure of accuracy. That 
is, we consider 

EV [n/ s (A)]=Var(|n/ 8 (X)-/(X)|) 

When computing nfi(X), one sums up 2' Si '~' x ' noisy 
counts, each with noise independently generated according 
to Lap (7^-), which has variance ^" Ni . We thus have: 



EV [nfi (A)] = 2 



|B 4 |-|X| + 1 W 



e 2 N 2 



(4) 



When two or more bases overlap, some itemsets may be 
subsets of more than one bases, and one thus obtains mul- 
tiple noisy counts of such an itemset, one from each basis 
including the itemset. In this case, these counts can be 
combined to obtain a more accurate count. Given two noisy 
counts of the itemset X, nf 1 with error variances vi and nf 2 
with error variance V2, the optimal way to combine them 



is to use 



V l+v 2 



nf 1 + 



n/ 2 , resulting in error variance 



v 1 +v 2 



This weighted averaging is done in lines 22 and 23. 
From Equation (4), it is easy to see that the worst-case 
error variance among all X and all Bi is 2 ^f N - 2 . To mini- 
mize such worst-case error variance, one wants to minimize 
w 2 2 e . 

Alternatively, one may want to minimize average-case er- 
ror variance. Given a basis set B, and a set Q of itemsets, 
we can compute the average-case error variance for using B 
to obtain noisy frequencies for itemsets in Q as follows. For 
each itemset X £ Q, if it is covered by a single basis Bi, then 
the error variance can be computed using Equation 4. If X 
is covered by more than one bases, one can also computed 
the error variance of the weighted average method. One can 
then take an average of the computed error variance for all 
itemsets in Q. 

We now consider a special case where one wants to ob- 
tain noisy frequencies for a set Q of k individual items, and 
show what basis set minimizes both the worst-case and the 
average-case error variance. One extreme is to use one basis 
for each item in Q. As a result, one adds Laplacian noise 
to the k counts, and has sensitivity k. The noise has dis- 
tribution Lap (|), resulting in error variance of k 2 V, where 
V — \. Now assume that the k items are divided into bases 
of size £, then we have w = k/£ bases. The noise variance 
for the frequency of each item is thus 



p 



-k?V. 



Note that 2 ^ is minimized at £ = 3, where it equals 4/9. 
Thus one obtains more than half reduction in the error vari- 
ance when compared with the direct method. 
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4.3 Constructing Basis Sets 

We now discuss how to construct a good S-basis set for 
— fk, the frequency of the fc-th most frequent itemset. 

Given a transaction dataset D and a threshold 0, many 
6-basis sets exist. As discussed above, we desire #-basis sets 
that have both small width and small length. We now in- 
vestigate some properties of #-basis sets, which help us con- 
struct #-basis sets. 

Proposition 2. The set Bi = {{xi,-- - ,xx}}, where 
xi, ■ ■ ■ ,x\ are all the items that are 9-frequent, is a 6-basis 
set of D of width 1 and length A. 

Proof. For any ^-frequent itemset X, all items in X 
must be ^-frequent, and therefore X C B\. □ 

We use A to denote the number of unique items in the top 
k itemsets. When A is small, then this basis-set would work 
fine. However, when A is large, we need other methods. 
Below we explore additional properties of S-basis sets. 

Proposition 3. The set B m = {Bi,B 2 , ■ ■ ■ , B w }, where 
Bi , B2, ■ ■ • , B w are all the maximal 9-frequent itemsets, is a 
9-basis set. Furthermore, any 8-basis set must have length 
at least as large as the length o/B m . 

Proof. Recall that a maximal frequent itemset is a 9- 
frequent itemset such that any superset of it is not 9- 
frequent. The length of this #-basis set equals the size of 
the largest maximal (9-frequent itemset. This 6>-basis set has 
the smallest length among all #-basis sets, because every 9- 
basis set must include a basis that is a superset of the largest 
maximal 9- frequent itemset. □ 

Proposition 4. Given a 9-basis set B = 
{Si, B2, . . . , B w }, merging any Bi and Bj (i.e., re- 
place Bi and Bj with Bi U Bj in B) results in another 
9-basis set of width w — 1. 

Propositions 3 and 4 together give one way to construct 
good basis sets. Given the maximal frequent itemsets, one 
can merge them as needed. The challenge is that it is un- 
clear how to publish the set of all maximal frequent itemsets 
while satisfying differential privacy. Below we show a way to 
privately over-approximate the maximal frequent itemsets. 

Definition 4. Let F be the set of 9-frequent items, and 
P be the set of all 9-frequent pairs of items. Observe that P 
involves only items in F. We define the 9-frequent pairs 
graph to be the graph where each node corresponding to an 
item in F and each edge corresponds to a frequent pair in 
P. 

We are interested in the maximal maximal cliques in 
the ^-frequent graph. A maximal clique, sometimes called 
inclusion-maximal, is a clique that is not included in a 
larger clique. The classic algorithm for finding all maxi- 
mum cliques is the Bron-Kerbosch Algorithm [12], which is 
widely used in application areas of graph algorithms. 

Proposition 5. Given D, the set of all maximal cliques 
in D's 9-frequent pairs graph form a 9-basis-set of D. 

PROOF. For any 0-frequent item x, x € F belongs to some 
maximal clique. For any ^-frequent itemset X of size > 2, 
by the apriori principle all items in X and all pairs of items 
in X must be 9- frequent, thus X corresponds to a clique in 
the ^-frequent pairs graph, which must be included in some 
maximal clique. □ 



The above proof also shows that each 0-frequent itemset 
must be a subset of some maximal clique of the f-frequent 
pair graph; however, a maximal clique may not be a 9- 
frequent itemset. For example, it may be the case that 
pairs {1, 2}, {2, 3}, {3, 4} are all 0-frequent, but the item- 
set {1,2,3} is not ^-frequent. Thus, the maximal cliques 
over-approximate the maximal frequent itemsets. 

We use Propositions 5 and 4 to construct a basis set. Algo- 
rithm 2 gives the algorithm for constructing a basis set that 
covers all maximal cliques in the graph constructed from F 
and P, while attempting to minimize the average-case error 
variance (EV) for pairs in P and items in F \ P, which we 
use to denote items that appear in F, but not in P. 

The algorithm starts with a basis set that has two parts: 
Bi includes the maximal cliques of size at least 2; B2 in- 
cludes items in F \ P grouped into itemsets of size 3 each, 
with possibly 1 or 2 items left. The algorithm then greedily 
merges bases in Bi, to reduce the EV. After this step, the 
algorithm tries to remove some basis in B 2 can distribute 
the items in them elsewhere, if doing so reduces the EV. 



Algorithm 2 ConstructBasisSet: Construct a Basis Set 
Using Frequent Items and Pairs 
Input: F, frequent items, and P, frequent pairs. 
Output: B, a basis set covering all maximal cliques in 
the graph (F,P). 

1: function ConstructBasisSet(F, P) 

2: Bi <— all maximal cliques of size at least 2 in the 

graph given by P 
3: B 2 <— items in F but not in P, divided into the 

smallest number of itemsets such that each contains at 

most 3 items 

4: Repeatedly find Bi,Bj £ Bi such that merging Bi 
and Bj results in the largest reduction of average-case 
error variance (EV) when using B = Bi U B 2 to obtain 
frequencies of itemsets in F and P; and update Bi by 
merging Bi , Bj ; stop when no merging reduces EV 

5: Repeatedly find Bi £ B 2 such that removing Bi and 
moving items in Bi to bases in Bi U B 2 with smallest 
sizes results in the largest EV-reduction; update B when 
Bi is found; stop when no such Bi can be found 

6: return B = Bi U B 2 

7: end function 



4.4 Putting Things Together for PrivBasis 

Now we are able to put all the pieces together for the 
PrivBasis method. The algorithm is given in Algorithm 3. 

Recall that the algorithm has five steps, as given in Sec- 
tion 4.1: (1) Get lambda; (2) Get frequent items; (3) Get 
frequent pairs; (4) Construct the basis set; (5) Get noisy 
counts. 

Privacy Budget Allocation. The privacy budget e must 
be divided among the steps 1, 2, 3, 5. Step 4 does not access 
the dataset D, and only processes the outputs of earlier 
steps. We divide the privacy budget into three portions: 
Qfie is used for Step 1 (obtaining A), a 2 e is used for Steps 
2 and 3 combined, a^e is used for Step 5. In our experi- 
ments, we chose qi = 0.1, a 2 = 0.4, and 03 = 0.5 for all 
datasets. These choices were not tuned, and may not be 
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Algorithm 3 PrivBasis: 

Itemsets 



Privately Releasing Frequent 



Input: Transactional dataset D, items I, k, differential 
privacy budget e. 

Algorithmic Parameters: 01+02 + 03 = 1 decides 
what proportions of the privacy budget are allocated to 
the different steps. We use ai = 0.1, 02 = 0.4, 03 — 0.5. 
Parameter 77, which we set at either 1.1 or 1.2, is the safety 
margin parameter. 



function PrivBasisMain(Z), I, k, e) 

A +- GetLambda(D, I, k, aie) 

if A < 12 then 

F +- GetFreqItems(D, I, A, a 2 e) 

return BasisFreq(Z), {F}, k, (1 - ai - a 2 )e) 

else 

A2 = r\ ■ k — A 

A2 «— A2/\/max(l, A2/A) 
/3i «-a 2 • A/(A + A 2 ) 

/3 2 <- Oi2 - @1 

F +- GetFreqElements(Z), I, A, /3ie) 
U <— all pairs of items in F 
P «- GetFreqElements(Z), U, A 2 , fee) 
B = ConstrBasis(F, P) 
return BasisFreq(_D, B, k, (1 - ai - a 2 )e) 
end if 
end function 

function GetLambda(D, I, k, e) 
N +- number of transactions in D 
ki «- \k ■ 77] 

6 frequency of fci'th itemset 
for i = 1 — >■ |7| do 

/ +- frequency of i'th item 

p[i\ «- c (i-l/-«l)-"-«/a 
end for 

A +- sample i G according to p[i] 

return A 
end function 



function GetFreqElements(D, U, A, e) 
N <s— number of transactions in D 
for i = 1 — ¥ \U\ do 

/ frequency of U[i] 
p[i] <- e f - e/x 
end for 

for i = 1 — > A do 

X[i] <— sample from U[i] according to p[i] 

remove X[i] from U 
end for 
return X 
end function 



optimal; it appears that the optimal allocation depends on 
characteristics of the dataset D and the value k. 

Step 3 is needed only when A, the number of unique items 
that appear in the top k frequent itemsets, is > 12. Recall 
that when A < 12, we construct B to consist of a single 
basis that includes the A most frequent items, and do not 
need step (3) to obtain frequent pairs. When A is small, we 
let Step 2 use the whole of a2t. When Step 3 is needed, the 
privacy budget «2£ must be allocated between Steps 2 and 



3. This allocation is done according to how many frequent 
items and pairs we want to get. To obtain A most frequent 
items, and A2 most frequent pairs, Step 2 gets A/(A + A2) 
portion of a.2t, and Step 3 gets the rest. 

Step 1: Get A. Step 1 is done using the GetLambda 
function in Algorithm 3. Intuitively, one can use the ex- 
ponential method to sample j from {1,2, •■• , k} with the 
following quality function. 

q(D,i) = (l-\f k -fitem^N 

where fitem^ is the frequency of the j'th most frequent item. 
That is, we want to choose j such that the j 'th most frequent 
item has frequency closest to that of the fe'th most frequent 
itemset. 

The sensitivity of the above quality function is 1, be- 
cause adding or removing a transaction can affect ft by 
at most 1/N and fitem^ by at most 1/N. Furthermore, fk 
and fitern- cannot change in different directions (i.e., one 
increases while the other decreases) . 

In Algorithm 3, rather than using fk in the above quality 
function, we use fk ± , where ki = k • 77, and 77 is a safety mar- 
gin parameter that we set at either 1.1 or 1.2, depending on 
k. The reason for doing this is to avoid the error in which 
the obtained A is too small, because then we may miss a sig- 
nificant number of top k itemsets with basis set constructed 
with top A items. When the obtained A is slightly larger 
than the correct value, this will just cause the privacy bud- 
get to be divided somewhat thinner, an effect we can tolerate 
better. 

Steps 2 and 3: Get frequent items and pairs. Both 
Steps 2 and 3 use the GetFreqElements function, which 
privately selects a number of itemsets with highest frequen- 
cies from a set U. It uses repeated sampling without re- 
placement, where each sampling step uses the exponential 
method with the frequency of each itemset as its quality. 

In Step 2, we are selecting from all items in /, thus the 
candidate set size is |/|, the resulting set is F. In Step 3, we 
only need to select pairs of items in F; thus, the set U from 
which we are selecting has only elements, which is quite 
small. 

When determining A2, the number of frequent pairs in the 
top k itemsets, the naive method is to set A2 = 77 • k — A. 
This, however, is not ideal. In Table 2(b), we see that for the 
pumsb-star dataset, the top 100 itemsets include 17 items 
and 31 pairs. We desire a A2 value to be larger than 31, but 
not too large. Setting A2 = rj-k—X results a value close to 100 
for r\ = 1.2. Obtaining top 100 pairs and constructing basis 
to cover them is inaccurate, both because each pair must 
be selected with less privacy budget, and because having to 
cover 100 pairs results in larger basis. While the best value 
of A2 depends on the dataset, we use the following heuristic 
formula. 



A 2 



A 2 



\/max(l, Aj/A) 



, where A' 2 = r\ ■ k — A 



The intuition is that when the ratio of A 2 /A is large, then 
we expect that a significant proportion of the top A items 
to be non-pairs, so we divide A 2 by the square root of the 
ratio. For the pumsb-star dataset, when the noisy A = 20, 
the A2 value computed as above equals 44. 

As all data-dependent step in Algorithm 3 satisfies differ- 
ential privacy, we have the following theorem. 
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Theorem 6. Algorithm 3 is e- differentially private. 

5. EXPERIMENTS 

In this section, we demonstrate the efficacy of our ap- 
proach through extensive experimental evaluation on a num- 
ber of real datasets. We begin by describing the datasets we 
use as well as the utility measures we employ. We then 
present our experimental results. 

Datasets To facilitate experimental evaluation, we run our 
algorithm on the following 5 datasets. Table 2(b) on page 
provides a description of the number of transactions in each 
datasets, the number of distinct items, as well as the average 
transaction length. 

• Retail Dataset [2]. This is a retail market basket data 
from an anonymous Belgian retail store. Each trans- 
action in the dataset is a set of items in one receipt 
and there are 88,162 receipts in total. 

• Mushroom Dataset [2]. In this dataset, each record de- 
scribes the physical attributes such as color of a single 
mushroom. 

• AOL Search Log Dataset [1]. Each line of this dataset 
contains the randomly assigned userlD, the search 
query string, the time stamp and the clicked URL. We 
preprocess the logs by removing the stop words and 
performing word stemming. By treating each query 
keyword as an item, we transform the preprocessed 
dataset into a transaction dataset by grouping the 
search query keywords of the same userlD into one 
transaction. 

• Pumsbstar Dataset [2]. The Pumsb dataset is cen- 
sus data from PUMS (Public Use Microdata Sample) . 
Pumsb_star represents a subset of this dataset suitable 
for data mining purposes. 

• Kosarak Dataset [2]. Kosarak contains the click- 
stream data of a Hungarian online news website. Each 
transaction is a click stream from a user. 

Utility Measures We evaluate the utility by employing 
the following standard metrics. 

• False negative rate: This measures the fraction of ac- 
tual frequent itemsets which do not appear in the pub- 
lished result 

FNR = FalseNegatives 

k 

We point out that this is the same as the False Positive 
Rate, the fraction of identified top k itemsets that are 
not in the actual top k. 

• Relative error of published itemset counts: This mea- 
sures the error with respect to the actual itemset fre- 
quency in the dataset. This is calculated over all pub- 
lished frequent itemsets. 

RE = median -/<*>! . 



5.1 Experimental Results 

We compare the efficacy of our approach described in Al- 
gorithm 3 to the method in [8], which is described in Sec- 
tion 3. We use PB (for PrivBasis) to denote our method, 
and TF (for Truncated Frequency) to denote the method 
in [8]. 

Our algorithm is adaptive based on the nature of the 
dataset involved and the value of k desired. More specifi- 
cally, the value of A determines how the basis set is selected. 
We thus roughly divide our experiments into three groups 
to demonstrate the efficacy of our algorithm under different 
scenarios. 

As we point out in Section 3, the TF algorithm becomes 
inaccurate, and, in some cases, cannot be applied for large 
values of m. Hence, we test different values of m and report 
the results for the value that provides the best precision. 

In our experiments, we vary e and report the results. We 
repeat all our experiments 3 times and report the mean of 
the results as well the standard error. 

Small A, single basis. When A is small, our PB method 
uses a single basis with all the top A frequent items. We 
are able to observe this scenario with the Mushroom and 
Pumbsjstar datasets with values of k less than 150. The 
results are shown in Figure 1 and Figure 2. The results on 
both datasets show that the PB method consistently and 
significantly out-performs the TF method both in terms of 
false negative rate and relative error. In fact, the perfor- 
mance of PB with larger k significantly outperforms that of 
TF with a smaller k. 

For both datasets, the FNR for PB is close to even when 
e is 0.5. In addition, the relative error is consistently small 
which indicates that we can get relatively high accuracy for 
the released itemset counts. On the other hand, the TF 
method has unacceptably large FNR both for larger k and 
for smaller e. For example, for getting the top 100 itemsets 
in the Mushroom dataset, TF has FNR at over 0.6 even 
when e=l; and for getting the top 150 itemsets in the 
Pumsb Star dataset, TF has FNR at over 0.7 even when 
e — 1. For obtaining the top 50 itemsets, at e = 0.5, TF 
has FNR at about 0.6 and 0.4 for the two datasets. This 
confirms our analysis in Section 3. 

Larger A, small number of basis. For larger and sparser 
datasets, A can be large enough to make the construction of 
a single basis unfeasible. This is the case for the retail and 
kosarak datasets. We run our experiments and construct 
bases of length 7 each as described by our algorithm in the 
previous section. The results for these datasets are shown in 
Figure 3 and Figure 4. We again see that PB out-performs 
TF. While the performance of PB is accurate even when 
k = 400, the PB method has acceptance FNR only for k — 
100 and e > 0.5. An interesting observation here is that for 
the retail dataset, the FNR is worse than the other datasets 
on all accounts. Upon investigation, we realized that this is 
mainly due to the nature of the dataset. For larger k, there 
are many itemsets whose frequencies are lower than fk but 
very close fk . Hence the ratio of the probability of selecting 
the correct top k itemsets over the other is not large. 

A w k, large number of basis. For very sparse datasets, 
such as search log datasets, the number of frequent itemsets 
are largely dominated by frequent items. This is the case for 
the AOL dataset, for which the top 200 frequent itemsets 
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Figure 1: Results of the PrivBasis (PB) method and the Truncated Frequency (TF) method for the Mushroom 
dataset, with = 50 and k = 100; m is the maximum frequent itemset length that provides the highest precision 
for TF. 
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Figure 2: Results of PB and TF for the Pumsb Star dataset, with k = 50 and k — 150; m is the maximum 
frequent itemset length that provides the highest precision for TF. 
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Figure 3: Results of PB and TF for the Retail dataset, with k — 50 and k = 100; m is the maximum frequent 
itemset length that provides the highest precision for TF. 
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Figure 4: Results for the Kosarak dataset, first row showing results for k — 100 and k — 200; second row 
showing results for k = 300 and k = 400; m is the maximum frequent itemset length that provides the highest 
precision for TF. 
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contain 171 singletons and 29 pairs. For the TF method, it 
is unfeasible to run the algorithm for m > 1 and produce 
accurate results since |I| is quite large. The results are shown 
in Figure 5. This is the dataset where TF performs closest to 
PB, because while TF degenerates into finding all frequent 
singleton items, this can cover a large number of frequent 
itemsets. And when the problem more or less degenerates 
into finding the top k items, the advantage of PB over TF 
is small. 

6. RELATED WORK 

Differential privacy was presented in a series of papers 
[15, 18, 9, 17, 16], and methods of satisfying it are presented 
in [16, 30, 28]. Work on differential privacy has initially 
focused on answering statistical queries; however, recent lit- 
erature has focused on using differential privacy for data 
publishing scenarios [10, 19]. Differential privacy has also 
been employed to release contingency table [7], publish his- 
tograms [23, 33], and privately match records [25]. McSh- 
erry and Mironov [27] present differentially private recom- 
mendation algorithms in Netflix Prize competition. More 
recently, differential privacy has been adapted to release ac- 
curate data mining models and results [19, 29, 8]. 

The existing work most related to ours is Bhaskar et al. 
[8], which releases differentially private frequent itemsets. 
We have discussed this approach in detail in Section 2. 

Atzori et al. [6] investigated the problem of modifying the 
supports of frequent itemsets, while concealing the sensitive 
information of individuals. It requires that each pattern 
derived from the released frequent itemsets have a support 
of either or at least fc, a positive integer threshold. This 
solution is based fc-anonymity [31] privacy model, which is 
a much weaker privacy notion than differential privacy. 

Chen et al. [14] studied the releasing of transaction 
dataset while satisfying differential privacy. They present 
an algorithm, which partitions the transaction dataset in a 
top-down fashion guided by a context-free taxonomy tree, 
and reports the noisy counts of the transactions at the 
leaf level. This method generates a synthetic transaction 
dataset, which can be then used to mine the top k frequent 
itemsets. For the datasets we consider in this paper, this 
method generates either an empty synthetic dataset or a 
dataset that is highly inaccurate. An analysis of the method 
shows that this method can provide reasonable performance 
only when the number of items is small. (One dataset used 
in [14] for evaluation is the MSNBC dataset which has 17 
items and about 1 million transactions.) 

Work on releasing differentially private private search logs, 
including the AOL search log dataset, has been addressed 
in [26] and [21]. These works differ from our work in that 
they focus on releasing the top frequent keywords that oc- 
cur in the search logs, and does not release any information 
about how frequent itemsets with size 2 or higher. This is 
essentially mining for frequent itemsets of length 1 . Another 
difference is that their approach assume that the keywords 
in the dataset are not public knowledge, whereas we assume 
I is public. As a result, their approach satisfies a relaxed 
version of differential privacy similar to the notion of (e, <5)- 
differential privacy. 

In addition, there is another series of works [32, 24, 20, 
34, 13, 14] on publishing anonymized transaction data, in- 
stead of releasing privacy-preserving mining results [19, 29, 
8, 6]. Terrovitis et al. [32] apply a relaxation of fc-anonymity 



on transaction dataset, by requiring that for each itemset 
with the length of at most m, the number of transactions in 
the dataset containing this itemset is either or at least k. 
He and Naughton [24] enhance [32] by strictly imposing fc- 
anonymity. The two solutions [32, 24] treat each item in the 
dataset equally. Different from them, the schemes [20, 34] di- 
vide items into sensitive and non-sensitive ones, and assume 
that an adversary can only get the background knowledge 
about the non-sensitive items. The algorithms in [20, 34] 
ensure that the inference from non-sensitive items to a sen- 
sitive one is lower than a threshold. Cao et al. [13] relax the 
assumption, and allow an attacker to include sensitive items 
in his/her background knowledge. They provide a privacy 
principle p- uncertainty, which postulates that the confidence 
of inferring a sensitive item from any itemset (consisting of 
both sensitive and non-sensitive items) be lower than p, a 
threshold. 

7. CONCLUSION 

In this paper, we have introduced PrivBasis, a novel 
method of publishing frequent itemsets with differential pri- 
vacy guarantees. The intuition behind PrivBasis is simple. 
Given some minimum support threshold, 6, one can con- 
struct a basis set B = {Bi, B2, ■ ■ ■ , B w } such that any item- 
set with frequency higher than 9 is a subset of some basis Bi. 
We have introduced techniques for privately constructing 
basis sets, and for privately reconstructing the frequencies of 
all subsets of Bi's with reasonable accuracy. One can then 
select the most frequent itemsets from such reconstructed 
subsets. We have conducted experiments on 5 real datasets 
commonly used for frequent itemset mining purposes, and 
the results show that our approach greatly outperforms the 
current state of the art. Our approach can be viewed as a 
dimension reduction to deal with the curse of dimensionality 
in private data analysis and data anonymization. 
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